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Abstract 

This work deals with parallel optimization of expensive objective func¬ 
tions which are modelled as sample realizations of Gaussian processes. 
The study is formalized as a Bayesian optimization problem, or continu¬ 
ous multi-armed bandit problem, where a batch of q > 0 arms is pulled 
in parallel at each iteration. Several algorithms have been developed for 
choosing batches by trading off exploitation and exploration. As of to¬ 
day, the maximum Expected Improvement (El) and Upper Confidence 
Bound (UCB) selection rules appear as the most prominent approaches 
for batch selection. Here, we build upon recent work on the multipoint 
Expected Improvement criterion, for which an analytic expansion relying 
on Tallis’ formula was recently established. The computational burden of 
this selection rule being still an issue in application, we derive a closed- 
form expression for the gradient of the multipoint Expected Improvement, 
which aims at facilitating its maximization using gradient-based ascent al¬ 
gorithms. Substantial computational savings are shown in application. In 
addition, our algorithms are tested numerically and compared to state- 
of-the-art UCB-based batch-sequential algorithms. Combining starting 
designs relying on UCB with gradient-based El local optimization finally 
appears as a sound option for batch design in distributed Gaussian Pro¬ 
cess optimization. 

Keywords : Bayesian Optimization, Batch-sequential design, CP, UCB. 
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1 Introduction 


Global optimization of deterministic functions under a drastically limited eval¬ 
uation budget is a topic of growing interest with important industrial applica¬ 
tions. Dealing with such expensive black-box simulators is typically addressed 
through the introduction of surrogate models that are used both for recon¬ 
structing the objective function and guiding parsimonious evaluation strate¬ 
gies. This approach is used in various scientific communities and referred to as 
Bayesian optimization, but also as kriging-based or multi-armed bandit opti¬ 
mization mum mmmm- Among such Gaussian process optimization 
methods, two concepts of algorithm relying on sequential maximization of in¬ 
fill sampling criteria are particularly popular in the literature. In the EGO 
algorithm of [TBj, the sequence of decisions (of where to evaluate the objective 
function at each iteration) is guided by the Expected Improvement (El) criterion 
)19] . which is known to be one-step lookahead optimal jT4j. On the other hand, 
the Upper Confidence Bound (UCB) algorithm |1] maximizes sequentially a well- 
chosen kriging quantile, that is, a quantile of the pointwise posterior Gaussian 
process distribution. Similarly to El [23J [B], the consistency of the algorithm 
has been established and rates of convergence have been obtained [23]. 

Recently, different methods inspired from the two latter algorithms have 
been proposed to deal with the typical case where q > 1 CPUs are available. 
Such synchronous distributed methods provide at each iteration a batch of q 
points which can be evaluated in parallel. For instance, m generalizes the 
UCB algorithm to a batch-sequential version by maximizing kriging quantiles 
and assuming dummy responses equal to the posterior mean of the Gaussian 
process. This approach can be compared with the so-called Kriging Believer 
strategy of m where each batch is obtained by sequentially maximizing the 
one-point El under the assumption that the previously chosen points have a 
response equal to their Kriging mean. Originally, the strategies suggested in 
m were introduced to cope with the difficulty to evaluate and maximize the 
multipoint Expected Improvement (g-EI) )22j . which is the generalization of 
El known to be one-batch lookahead optimal [3 H4] , One of the bottlenecks 
for g-EI maximization was that it was until recently evaluated through Monte- 
Carlo simulations EE a reason that motivated m to propose a stochastic 
gradient algorithm for its maximization. Now, [8] established a closed-form 
expression enabling to compute g-EI at any batch of q points without appealing 
to Monte-Carlo simulations. However, the computational complexity involved 
to compute the criterion is still high and quickly grows with q. Besides, little 
has been published about the difficult maximization of the g-EI itself, which 
is an optimization problem in dimension qd , where d is the number of input 
variables. 

In this work, we contribute to the latter problem by giving an analytical 
gradient of g-EI, in the space of dimension qd. Such a gradient is meant to 
simplify the local maximization of g-EI using gradient-based ascent algorithms. 
Closed-form expressions of g-EI and its gradient have been implemented in the 
DiceOptim R package ED, together with a multistart BFGS algorithm for max- 
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imizing g-EI. In addition, we suggest to use results of the BUCB algorithm as 
initial batches in multistart gradient-based ascents. These starting batches are 
shown to yield good local optima for gr-EI. This article is organized as fol¬ 
lows. Section [2] quickly recalls the basics of Gaussian process modeling and the 
closed-form expression of g-EI obtained in [5]. Section [3] details the analytical 
q-Fl gradient. Finally, numerical experiments comparing the performances of 
the g-EI maximization-based strategy and the BUCB algorithms are provided 
and discussed in Section O For readability and conciseness, the most technical 
details about g-EI gradient calculation are sent in Appendix. 


2 General Context 

Let / : x £ D C t d —> R be a real-valued function defined on a compact subset 
D of R d ,d > 1. Throughout this article, we assume that we dispose of a set 
of n evaluations of /, A n = (x 1:n := {x 1 . ,x n },y 1:n = (f(x i),...,/(®„)) T ), 
and that our goal is to evaluate / at well-chosen batches of q points in order to 
globally maximize it. Following each batch of evaluations, we observe q deter¬ 
ministic scalar responses, or rewards, y n +i = f( x n+ 1 ), • •., y n + q = f( x n+q)- We 
use past observations in order to carefully choose the next q observation loca¬ 
tions, aiming in the end to minimize the one-step lookahead regret fix*) —t n +q , 
where x* is a maximizer of / and tj = ma 2 tj = i l „. t i(f(xj)). In this section, we 
first define the Gaussian process (GP) surrogate model used to make the de¬ 
cisions. Then we introduce the g-EI which is the optimal one-batch lookahead 
criterion (see, e.g., QT] [Ti T2 for a definition and [HI 17] for a proof). 

2.1 Gaussian process modeling 

The objective function / is a priori assumed to be a sample from a Gaussian 
process Y ~ QV(y,C), where /x(-) and C(-,-) are respectively the mean and 
covariance function of Y. At fixed y(-) and C(-, •), conditioning Y on the set of 
observations A n yields a GP posterior Y{x)\A n ~ GV{y,n,C n ) with: 

y n (x) = n(x) + c n {x) T C~ 1 {y l . n - n(x i : „)), and (1) 

Cn(x,x ) — (j (x^ x ) c n (x) C n c n (x ), (2) 

where c n (x) = (C(x,Xi))i<i< n , and C n = {C(xi,Xj))i< it j< n . Note that, in 
realistic application settings, the mean and the covariance y and C of the prior 
are assumed to depend on several parameters which require to be estimated. 
The results presented in this article and their implementations in the R pack¬ 
age DiceOptim are compatible with this more general case. More detail about 
Equations [Q [5] with or without trend and covariance parameter estimation can 
be found in m and is omitted here for conciseness. 
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2.2 The Multipoint Expected Improvement criterion 

The Multipoint Expected Improvement (g-EI) selection rule consists in max¬ 
imizing, over all possible batches of q points, the following criterion, which 
depends on a batch X = (jE n+ i,..., x n+q ) £ D q : 

EI(X) = E [(maxF(X) — T n ) + | A n \, (3) 

where (•)+ = max(-, 0), and the threshold T n is the currently observed maximum 
ofY, i.e. T n = maxi<j<„ Y(xj). Recalling that Y(X)\A n ~ J\f(n n (X), C n (X, X)), 
and denoting Y ( X ) = (Yi,..., Y q ) T , an analytic expression of g-EI at locations 
X over any threshold Tel can be found in [S] and is reproduced here : 


E1(X) = £ (m k - T)$ g)S (*o ( 


k =1 



+ XXWK’) 





(4) 


where p a 2 (') and are respectively the density function of the centered 

normal distribution with variance a 2 and the p-variate cumulative distribution 
function (CDF) of the centered normal distribution with covariance r ; m = 
E(Y(X)|M„) and E = cov(F(X)|M„) are the conditional mean vector and 
covariance matrix of Y(X) ; m^ and 1 < k < q, are the conditional 

mean vector and covariance matrix of the affine transformation of Y(X), Z ^ = 
lWy(Jf) + b w , defined as Zf ] := Yj for j ± k and zf ] := T-Y k ; and finally, 
for (fe, i) £ {1,..., g} 2 , rn'yp and E^ are the mean vector and covariance matrix 

of the Gaussian vector (Z^)\Z^ = 0), the index — i meaning that the i th 
component is removed. 


3 Gradient of the multipoint Expected Improve¬ 
ment 

In this section, we provide an analytical formula for the gradient of q-EI. Getting 
such formula requires to carefully analyze the dependence of g-EI written in 
Eq. gD on the batch locations X £ K. 9Xd . This dependence is summarized in 
Fig. □ and exhibits many chaining relations. In the forthcoming multivariate 
calculations, we use the following notations. Given two Banach spaces E and 
F, and a differentiable function g : E —»• F, the differential of g at point x, 
written d x [g\ : E F, is the bounded linear map that best approximate g 
in the neighborhood of x. In the case where E = M. p and F = R, it is well 
known that Wi £ E,d x [g](h) = (V g(x),h). More generally the differential 
can be written in terms of Jacobian matrices, matrix derivatives and/or matrix 
scalar products where E and/or F are R p or M. pxp . To simplify notations and 
handle the different indices in Eq. ©, we fix the indices i and k and focus on 
differentiating the function EI (fc ^^, standing for the generic term of the double 
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Figure 1: Link between the different terms of Eq. © and the batch of points 

X 


(fc)M 


by noticing 


sums in Eq. ©■ We can perform the calculation of dx I El 

than EI^^ can be rewritten using the functions gj, 1 < j < 8 defined on Fig. [T] 
as follows: 


E I«« _ ( mfc ~ T) ■ g 7 o G + g 4 o g 2 ■ g 5 ° G ■ g 8 o g 6 o G, 


( 5 ) 


where G = {g 8 o g 7l g 4 o g 2 ), o is the composition operator and • the multipli¬ 
cation operator. The differentiation then consists in applying classical differ¬ 
entiation formulas for products and compositions to Eq. ( 0 . Proposition [3] 
summarizes the results. For conciseness, the formulae of the differentials in¬ 
volved in Eq. © are justified in the Appendix. The calculations notably rely 
on the differential of a normal cumulative distribution functions with respect to 
its covariance matrix obtained via Plackett’s formula © 

Proposition 1. The differential of the multipoint Expected Inmprovement cri¬ 
terion of Eq. © is given by dx [El] = Y^k=i X^=i dx EI (fc )W , with 


dx 


El( fc )(0 


dx [mk] ■ g 7 ° G + (rrik — T) . d G ( X ) [di] ° dx [G] 


( 6 ) 


+ d g2 (x) M ° d x [. 92 ] ■ 95 0 G . 9s o g 6 o G 
+ 94 ° 92 ■ d G(X ) [ 95 ] 0 d x [G] . 9s o g 6 o G 
+ 94 0 92 ■ 95 0 G . d ge (G(x)) bs] 0 d G (x) [96] 0 dx [G], 


where the g 7 ’s are the functions introduced in Fig. |7J The gj’s and their respec¬ 
tive differentials are as follow : 

• 9l ■' X G D q —» 9l(X) = ( : fl n (, x j))l<j<q € R 9 , 
dx [ 91 ] (H) = 

with V/i n (sj) = V/x ( Xj ) + ( dCn ^ ] ) G" 1 (y 1 :n -n(x 1:n )). 

\ / 1 <Z<a 


• 92 : X G D q ->• 92(X) = (C n (xj, xe))i<j } e< q G <S 9 + . <S 9 + is the set of 
q x q positive definite matrices. 
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dx [,g 2 ] (H) = ((v x C n (a+ (v x C n (x e , Xj ), Hj^)) 
with V x C n (x,x) = V x C(x, x) - ( 8 c g^ ) ) Cf^c^x). 

• G : X -> (m( fe ),E«),, d x [G] = (L(*>d x [gi] ,L< fc >d x [g 2 ]T (fc)T ). 

• 37 .* (a,T) G x G 

dcfjc) H (fc.J?) = (/ l ,V a! $, iSW (mW))+tr(ffV s $ 5 , EW (m( fc ))). V^ecw 
and V E $ 9iE (k) are £/ie gradient of the multivariate Gaussian CDF with re¬ 
spect to x and to the covariance matrix, given in appendix. 


• g 4 ; E -t s (fc) , d 92(X ) M (H) = lMhlW. 

• <75 : (a, r) ePx S+ + ->• ¥>r«(a») 6 K, 
d>G(x) [55] (^,-ff) = ( 

• g 6 ; (mW,E( fc )) G 1 


‘-hi + \ 

(if - r~) H • 

5 ++ -*■ 

s [?)> 

(h-i- 

hi ^(fc) 1 

V (fc)“-M 1 


(fc) rj (fc) 

»i g » V.(fc) K „ 

E (fc )2 “M E (fc) M ’ 


■ ^ (t)T 


yu'W yu 

,(k )2 -*d 




v (k) v (fc) -<.* -*•* 


• 38 - (n,r) G ' x — t’ $g_i ; r(ei) G K., 

d 96 (G(x)) M = (A, (mW)} + tr(ffV E $ 5jE < fc) (m^)). 

The gradient of 3 -El, relying on Eq. ([ 6 ]) is implemented in the version 1.5 of 
the DiceOptim R package 0 , together with a gradient-based local optimization 
algorithm. In the next section, we show that the analytical computation of 
the gradient offers substantial computational savings compared to numerical 
computation based on a finite-difference scheme. In addition, we investigate the 
performances of the batch-sequential EGO algorithm consisting in sequentially 
maximizing g-EI, and we compare it with the BUCB algorithm of m- 


4 Numerical tests 

4.1 Computation time 

In this section, we illustrate the benefits - in terms of computation time - of 
using the analytical gradient formula of Section [3] We compare computation 
times of gradients computed analytically and numerically, through finite differ¬ 
ences schemes. It is important to note that the computation of both 3 -EI and 
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•hg-3 

$q-2 

1 

‘Ih, 

Total 

analytic g-EI 

0 

0 

q z 

q 

0(q 2 ) 

finite differences gradient 

0 

0 

q(d+ 1) q 2 

q(d+ 1) q 

0(dq 3 ) 

analytic gradient 

„2 9<9-l) 

^ 2 

q q{ - q - 1] + q 3 

q 2 + 2 q 2 

q 

0(q 4 ) 


Table 1: Total number of calls to the CDF of the multivariate Gaussian distri¬ 
bution for computing q -El or its gradient for a function with d input variables. 
The last column gives the computational complexity as a function of q and d. 



2 4 6 8 10 

Dimension 


Figure 2: Ratio between computation 
times of the numerical and analytical 
gradient of q-EI as a function of the di¬ 
mension d and the batch size q. The 
hatched area indicates a ratio below 1. 



Number of batch selections 


Figure 3: Logarithm of the average 
(plain lines) and 95% quantile (dotted 
lines) of the regret for three different 
batch-sequential optimization strategies 
(see Section 4.2 for detail). 


its gradient (see, Eqs. fU),©) involve several calls to the cumulative distribu¬ 
tion functions (CDF) of the multivariate normal distribution. The latter CDF 
is computed numerically with the algorithms of m wrapped in the mnormt R 
package [?]• In our implementation, computing this CDF turns out to be the 
main bottleneck in terms of computation time. The total number of calls to 
this CDF (be it in dimension q , q — 1, q — 2 or q — 3) is summarized in Table [T) 
From this table, let us remark that the number of CDF calls does not depend 
on d for the analytical g-EI gradient and is proportional to d for the numerical 
gradient. The use of the analytical gradient is thus expected to bring savings 
when q is not too large compared to d. Figure [2] depicts the ratio of computa¬ 
tion times between numerical and analytical gradient, as a function of q and d. 
These were obtained by averaging the evaluation times of q-EI’s gradient at 10 
randomly-generated batches of size q for a given function in dimension d being 
a sample path of a GP with separable Matern(3/2) covariance function [2TJ. In 
the next section, we use the values q = 6 and d = 5 and we rely exclusively on 
the analytical g-EI formula which we now know to be faster. 
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4.2 Tests 

4.2.1 Experimental setup 

We now compare the performances of two parallel Bayesian optimization algo¬ 
rithm based, respectively, on the UCB approach of [23] and on sequential g-EI 
maximizations. We consider a minimization problem in dimension d = 5 where 
n = 50 evaluations are performed initially and 10 batches of q = 6 observations 
are sequentially added. The objective functions are 50 different sample realiza¬ 
tions of a zero mean GP with unit variance and separable isotropic Matern(3/2) 
covariance function with range parameter equal to one. Both algorithms use the 
same initial design of experiment of n points which are all S-optimal random 
Latin Hypercube designs m- The mean and covariance function of the un¬ 
derlying GP are supposed to be known. Since it is difficult to draw sample 
realizations of the GP on the whole input space D := [0, l] d , we instead draw 
50 samples on a set of 2000 space-filling locations and interpolate each sample 
in order to obtain the 50 objective functions. 

Two variants of the BUCB algorithms are tested. Each of them constructs 
a batch by sequentially minimizing the kriging quantile /z* (a;) — j3 n s n {x) where 
s n (x ) = Vcu x,x) is the posterior standard deviation at step n and /z*(a:) is 
the posterior mean conditioned both on the response at previous points and at 
points already selected in the current batch, with a dummy response fixed to 
their posterior means in the latter case. Following the settings of [10], in the 
first and second variant of BUCB, the coefficients f3 n are given by: 

Pn ] : = 2 ^ mu it log + l ) 2 j and /3^ 2) := 2 /? mu i t log ^^(1 + 9 &) 2 ^ (7) 

where /3 mu it = 0 . 1 , 6 = 0 . 1 , and k is the number of already evaluated batches 
at time n, i.e., here, k G {0,..., 9}. The BUCB1 strategy is expected to select 
locations in regions with low posterior mean (exploitation) while BUCB2 is 
meant to favour more exploration due to a larger /3 n . The minimization of the 
kriging quantile presented above is performed using a genetic algorithm [18]. 
Regarding the algorithm based on g-EI sequential maximization, we propose 
to use a multi-start BFGS algorithm with analytical gradient. This algorithms 
operates gradient descents directly in the space of dimension qd = 30. To limit 
computation time, the number of starting batches in the multi-start is set to 
3. These 3 batches are obtained by running the BUCB1 algorithm presented 
above with 3 different values of /3 mu it equals to 0.05,0.1,0.2 respectively. 

At each iteration, we measure the regrets of each algorithm and average 
them over the 50 experiments. To facilitate the interpretation of results, we 
first focus on the results of the algorithms after 1 iteration, i.e. after having 
added only 1 batch of q points. We then discuss the results when 10 iterations 


are run. 



4.2.2 First step of the optimization 

To start with, we focus on the selection of the first batch. Table [5] compares the 
average g-EI and real improvement obtained for the three selection rules. For 
the first iteration only, the BUCB1 and BUCB2 selection rules are exactly the 
same. Since g-EI is the one-step optimal, it is not a surprise that it performs 


Selection rule 


g-EI 

BUCB 


Average expected 
improvement 

(g-EI) 


0.672 

0.638 


Average realized 
improvement 


0.697 

0.638 


Table 2: Expected and observed first batch Improvement for g-EI and BUCB 
batch selection methods, in average for 50 functions. 


better at iteration 1 with our settings where the objective functions are sample 
realizations of a GP. If only one iteration is performed, improving the g-EI 
is equivalent to improving the average performance. However, we point out 
that, in application, the maximization of g-EI was not straightforward. It turns 
out that the batches proposed by the BUCB algorithms were excellent initial 
candidates in our descent algorithms. The use of other rules for the starting 
batches, with points sampled uniformly or according to a density proportional 
to the one-point El, did not manage to yield this level of performance. 


4.2.3 10 optimization steps 

The average regret of the different batch selection rules over 10 iteration is 
depicted in Fig. [3] This Figure illustrates that choosing the one-step optimal 
criterion is not necessarily optimal if more than one iteration is run [I4j . In¬ 
deed, after two steps, g-EI maximization is already beaten by BUCB2, and g-EI 
becomes better again after iteration 7. Among the 50 optimized functions, g-EI 
maximization gives the smallest 10-steps final regret for only 30% of functions, 
against 52% for the BUCB1 and 18% for the BUCB2. On the other hand, 
the g-EI selection rule is eventually better in average since, for some functions, 
BUCB is beaten by g-EI by a wide margin. This is further illustrated with the 
curve of the 95% quantile of the regret which indicates that, for the worst simu¬ 
lations, g-EI performs better. This gain in robustness alone explains the better 
average performance of g-EI. Such improved performance comes at a price : the 
computational time of our multistart BFGS algorithm with analytical gradient 
is 4.1 times higher compared to the BUCB computation times. 


5 Conclusion 

In this article, we give a closed-form expression of the gradient of the multi¬ 
point Expected Improvement criterion, enabling an efficient g-EI maximization 
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at reduced computational cost. Parallel optimization strategies based on maxi¬ 
mization of g-EI have been tested and are ready to be used on real test case with 
the DiceOptim R package. The BUCB algorithm turns out to be a good com¬ 
petitor to g-EI maximization, with a lower computational cost, and also gives 
good starting batches for the proposed multistart BFGS algorithm. In general, 
however, the maximization of q -El remains a difficult problem. An interesting 
perspective is to develop algorithms taking advantage of some particular proper¬ 
ties of the g-EI function in the space of dimension qd 1 for example its invariance 
to point permutations. Other research perspectives include deriving cheap but 
trustworthy approximations of g-EI and its gradient. Finally, as illustrated in 
the application, g-EI sequential maximizations have no reason to constitute op¬ 
timal decisions for a horizon beyond one batch. Although the optimal policy is 
known [14j . its implementation in practice remains an open problem. 
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6 Appendix : Differential calculus 

• g i and are functions giving respectively the mean of Y’(.X’) and its 
covariance. Each component of these functions is either a linear or a 
quadratic combination of the trend function fi or the covariance function 
C evaluated at different points of X. The results are obtained by matrix 
differentiation. See the appendix B of m for a similar calculus. 

• (73 (resp. ( 74 ) is the affine (resp. linear) tranformation of the mean vector 
m into m^ (resp. the covariance matrix £ into £^). The differentials 
are then expressed in terms of the same linear transformation : 

dm [ 53 ] (h) = L^h and d s [g 4 ] (H) = L™HL^ r . 

• g 5 is defined by g 5 Then the result is ob¬ 

tained by differentiating the univariate Gaussian probability density func¬ 
tion with respect to its mean and variance parameters. Indeed we have 


< ^(m( fe ),s( fc )) [ 55 ] {h, H) — d m (k ) gs(-, (h)-{-d-£(k) < 75 ( 772 ,^, •) (H) 


(k) 

ge gives the mean and the covariance of Z_!\Zi = 0. We have : 


H fe) >4 fc) ) =56 ( m(fe) ’ s(fe) ) 


/ (*) 1 

1= 1 rn^) _ _ j y*(k) y'(^) _ _y'(^) yt(k)T 

\ -i s (fc) -M ’ s (fe) -m -M 
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d (m<*o,E<*o) [fle] (h,H) = d mW g e (fr) + d EW be] (m (/s) ,-) (i?) 


with : cL 


and : d E (k) 


56 (-,EW)] (h) 

g 6 {H) 


— (h hi y (fe) n 

- I _ ufcJ -v ’ u 


■" (fe) 77 

w i Hg (fc) 


(k) 

(fc )2 -M v (fc) ’ 


y,(fe)T 
v (fc )2 -»,* 


i(fe) 


//_, ,-s 


(fe)T 


_y(k) rrT 

-,(fc) »'•« -*>* 


• 37 and 38 : these two functions take a mean vector and a covariance ma¬ 
trix in argument and give a probability in output : S (k) (— m^) = 

37 (m( fe) ,£ (fe) ), $ 5 _ 1 )S (k) (—m^ fc) ) = 38 So, for {p,T,a} = 

{3,£W, —m ( fc )} or {3 — 1, we face the problem of differenti¬ 

ating a function $ : (a,T) —» $ P) r(a), with respect to (a, T) ePx *5++ ; 

d( a , r) [$] ( h , FT) = d a [$(•, T)] (h) + d r [$(a, •)] (#)• 

The the first differential of this sum can be written : 

da [*(•• r )i w = ((sbe*' r )) lSi£r > h )' 

ai a i—l&i+l a p 

with : ^-<f>(a,F) = <p p>r (u-*, a;)dii_j = <pi,r 4 i $ P -i,r 14 (a|i) • 


— OO —(X) —00 —00 


The last equality is obtained with the identity : Vit £ R 9 , Lp q ^(u) = 
<£i,r 44 {ui)<p p -ix^ ( u \i), with u\i = u_i - and T^ = T.^ - 


)i . The second differential is : 


dr [$(0,.)] (H) :=^tr(jT.r^ 7 (a,r) 


ij<pj ^ \ \dcitdaj / i,j< P 


<p2 I E {i ,j } , {i , i} (a:i,a:j)$p-2,E l y(a:| ij ) , if i + j, 


where: a^( a ’ r ) = <^ 


^ = 1 r« x daida 


■$(a,r). 


References 

[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the mul¬ 
tiarmed bandit problem. Machine learning , 47(2-3):235-256, 2002. 

[2] A. Azzalini and A. Genz. The R package mnormt: The multivariate normal 
and t distributions (version 1.5-1), 2014. 


11 



[3] J. Beet, D. Ginsbourger, L. Li, V. Picheny, and E. Vazquez. Sequential 
design of computer experiments for the estimation of a probability of failure. 
Statistics and Computing , 22(3):773-793, 2011. 

[4] S. M. Berman. An extension of placketts differential equation for the mul¬ 
tivariate normal density. SIAM Journal on Algebraic Discrete Methods, 
8(2):196-197, 1987. 

[5] E. Brochu, M. Cora, and N. de Freitas. A tutorial on bayesian optimiza¬ 
tion of expensive cost functions, with application to active user modeling 
and hierarchical reinforcement learning, eprint arXiv:1012.2599, arXiv.org, 
December 2010. 

[6] A. Bull. Convergence rates of efficient global optimization algorithms. Jour¬ 
nal of Machine Learning Research , 12:2879-2904, 2011. 

[7] C. Chevalier. Fast uncertainty reduction strategies relying on Gaussian 
process models. PhD thesis, University of Bern, 2013. 

[8] C. Chevalier and D. Ginsbourger. Learning and Intelligent Optimiza¬ 
tion - 7th International Conference, Lion 7, Catania, Italy, January 7-11, 
2013, Revised Selected Papers, chapter fast computation of the multipoint 
expected improvement with applications in batch selection, pages 59-69. 
Springer, 2014. 

[9] D. Ginsbourger and V. Picheny and O. Roustant and with contributions 
by C. Chevalier and S. Marmin and T. Wagner. DiceOptim: Kriging-Based 
Optimization for Computer Experiments, 2015. R package version 1.5. 

[10] T. Desautels, A. Krause, and J. Burdick. Parallelizing exploration- 
exploitation tradeoffs with gaussian process bandit optimization. In ICML, 
2012 . 

[11] P. I. Frazier. Parallel global optimization using an improved multi-points 
expected improvement criterion. In INFORMS Optimization Society Con¬ 
ference, Miami FL, 2012. 

[12] P. I. Frazier, W. B. Powell, and S. Dayanik. A knowledge-gradient pol¬ 
icy for sequential information collection. SIAM Journal on Control and 
Optimization, 47(5):2410-2439, 2008. 

[13] A. Genz. Numerical computation of multivariate normal probabilities. 
Journal of Computational and Graphical Statistics, 1:141-149, 1992. 

[14] D. Ginsbourger and R. Le Riche. Towards gaussian process-based opti¬ 
mization with finite time horizon. In Alessandra Giovagnoli, Anthony C. 
Atkinson, Bernard Torsney, and Caterina May, editors, mODa 9 96 Ad¬ 
vances in Model-Oriented Design and Analysis, Contributions to Statistics, 
pages 89-96. Physica-Verlag HD, 2010. 


12 



[15] D. Ginsbourger, R. Le Riche, and L. Carraro. Kriging is well-suited to 
parallelize optimization. In Computational Intelligence in Expensive Op¬ 
timization Problems, volume 2 of Adaptation Learning and Optimization, 
pages 131-162. Springer, 2010. 

[16] D. R. Jones, M. Sclronlau, and J. William. Efficient global optimization of 
expensive black-box functions. Journal of Global Optimization, 13(4):455- 
492, 1998. 

[17] Q. Y. Kenny, W. Li, and A. Sudjianto. Algorithmic construction of optimal 
symmetric latin liypercube designs. Journal of statistical planning and 
inference, 90(1):145- 159, 2000. 

[18] W. Mebane and J. Sekhon. Genetic optimization using derivatives: The 
rgenoud package for r. Journal of Statistical Software, Vol. 42, Issue 11:1 
26, 2011. 

[19] J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods 
for seeking the extremum. In L. Dixon and Eds G. Szego, editors, Towards 
Global Optimization, volume 2, pages 117 129. Elsevier, 1978. 

[20] C. R. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine 
Learning. MIT Press, 2006. 

[21] O. Roustant, D. Ginsbourger, and Y. Dcville. DiceKriging, DiceOptim: 
Two R packages for the analysis of computer experiments by Kriging-Based 
Metamodelling and Optimization. Journal of Statistical Software, 51 (1):1- 
55, 2012. 

[22] M. Schonlau. Computer Experiments and global optimization. PhD thesis, 
University of Waterloo, 1997. 

[23] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Information-theoretic 
regret bounds for gaussian process optimization in the bandit setting. IEEE 
Transactions on Information Theory, 58(5):3250-3265, 2012. 

[24] E. Vazquez and J. Beet. Convergence properties of the expected improve¬ 
ment algorithm with fixed mean and covariance functions. Journal of Sta¬ 
tistical Planning and inference, 140(ll):3088-3095, 2010. 

[25] J. Villcmonteix, E. Vazquez, and E. Walter. An informational approach 
to the global optimization of expensive-to-evaluate functions. Journal of 
Global Optimization, 44(4):509-534, 2009. 


13 



